Data and Functions

Packages and Functions

library(Rcpp)
library(ggplot2)
library(plotly)
library(data.table)
library(ggpubr)
library(vegan)
library(phyloseq)
# devtools::install_github("schuyler-smith/schuylR")
library(schuylR)
# devtools::install_github("schuyler-smith/ssBLAST")
library(ssBLAST)
setwd('~/google-drive/DARTE/')
get_sample_names <- function(x){gsub('\\.','_',gsub('.blast.*','',colnames(x)))}


Data Files

metadata <- fread('files/DARTE-QM-v3-metadata-fromKathyMou.csv')
  metadata[[1]] <- gsub('\\.', '_', metadata[[1]])
categories <- fread('files/category_all.tsv', header = FALSE)
  colnames(categories) <- c('Source', 'ARG_Class', 'Primer')
sample_metrics <- read.csv('darte-samples_R.csv')
  sample_metrics <- sample_metrics[!(sample_metrics$Sample %in% c('Mock-0-025spike-C_S12', 'Mock-0spike-A_S1')),]


Sequencing Results


Percent Identity

This is created by averaging the number of reads from all samples from a single source, i.e. the average number of reads aligned from soil at each percent identity threshold.


Relative Abundance

This is the same information as the Percent Identity graph, but represented as the proportion of the total number of reads from the raw FASTQ files.


Percent Aligned

This is the same information as the Percent Identity graph, but represented as the proportion of the total number of aligned reads. I.E. at each increasing percent identity, the number of reads aligned is divided by the number of reads aligned at 60% (the total number of aligned reads).


98% Identity match for reads of length >= 100 bp seems to be the ideal threshold for number of reads captured compared to confidence in alignment. For subsequent analyses, I used that threshold for determining identified ARGs.


Primer Success

Mock Samples

The rate of success for the primers can be determined by how many successfully amplified their intended target.


length(expected_mock)
## [1] 98
primer_success
## [1] 73
round(primer_success/length(expected_mock) * 100, 1)
## [1] 74.5

Of the 98 genes expected to be found in the Mock communities, 73 were captured by the primers.


Overall Success

This metric for success, is not necessarily fair; the primers were not made with the certainty that every target was present within any of the samples. The Mock communities were designed with known members, and thereby a better idea of the expected targets that would be amplified.

length(primers)
## [1] 762
primer_success
## [1] 203
round(primer_success/length(primers) * 100, 1)
## [1] 26.6



Unexpected

I addition to the 73 genes that were expected, there were an additional 61 genes detected that we did not predict.

unexpected
## [1] 63

Unexpected_graph

I addition to the 73 genes that were expected, there were an additional 61 genes detected that we did not predict. The notable is the ?sulfanomides? that we did not expect to be there, as well as the aph-6 and the lnu_C. The other odditty is how much more abundant these are than the expected. Max value of 15000, compared to ~2500.

Of the 97 genes expected to be found in the Mock communities, 72 were captured by the primers.


Spike Detection


Bar Graphs


Linear Relation


Table

##                    Sample Spike Spike_Level   Color
##  1:      Q3_Mock_0spike_B     0      0.0000 #757575
##  2:      Q3_Mock_0spike_C     1      0.0000 #757575
##  3: Q3_Mock_0_0025spike_A    20      0.0025 #E69F00
##  4: Q3_Mock_0_0025spike_B    11      0.0025 #E69F00
##  5: Q3_Mock_0_0025spike_C    12      0.0025 #E69F00
##  6:  Q3_Mock_0_009spike_A    37      0.0090 #56B4E9
##  7:  Q3_Mock_0_009spike_B    70      0.0090 #56B4E9
##  8:  Q3_Mock_0_009spike_C    87      0.0090 #56B4E9
##  9:  Q3_Mock_0_025spike_A   223      0.0250 #009E73
## 10:  Q3_Mock_0_025spike_B   159      0.0250 #009E73
## 11:   Q3_Mock_0_12spike_A   634      0.1200 #0072B2
## 12:   Q3_Mock_0_12spike_B   562      0.1200 #0072B2
## 13:   Q3_Mock_0_12spike_C   972      0.1200 #0072B2
## 14:   Q3_Mock_0_25spike_A  2027      0.2500 #D55E00
## 15:   Q3_Mock_0_25spike_B  1675      0.2500 #D55E00
## 16:   Q3_Mock_0_25spike_C  1053      0.2500 #D55E00


Nomal. Bar Graphs


Nomal. Linear Relation


Table

##                    Sample Spike Aligned Count Spike_Level   Color
##  1:      Q3_Mock_0spike_B     0   23795  0.00      0.0000 #757575
##  2:      Q3_Mock_0spike_C     1   42946  0.00      0.0000 #757575
##  3: Q3_Mock_0_0025spike_A    20   55061  0.04      0.0025 #E69F00
##  4: Q3_Mock_0_0025spike_B    11   24207  0.05      0.0025 #E69F00
##  5: Q3_Mock_0_0025spike_C    12   15981  0.08      0.0025 #E69F00
##  6:  Q3_Mock_0_009spike_A    37   29794  0.12      0.0090 #56B4E9
##  7:  Q3_Mock_0_009spike_B    70   44313  0.16      0.0090 #56B4E9
##  8:  Q3_Mock_0_009spike_C    87   45024  0.19      0.0090 #56B4E9
##  9:  Q3_Mock_0_025spike_A   223   37222  0.60      0.0250 #009E73
## 10:  Q3_Mock_0_025spike_B   159   27588  0.58      0.0250 #009E73
## 11:   Q3_Mock_0_12spike_A   634   56530  1.12      0.1200 #0072B2
## 12:   Q3_Mock_0_12spike_B   562   71342  0.79      0.1200 #0072B2
## 13:   Q3_Mock_0_12spike_C   972   70902  1.37      0.1200 #0072B2
## 14:   Q3_Mock_0_25spike_A  2027   85052  2.38      0.2500 #D55E00
## 15:   Q3_Mock_0_25spike_B  1675   54115  3.10      0.2500 #D55E00
## 16:   Q3_Mock_0_25spike_C  1053   27182  3.87      0.2500 #D55E00


Mock Samples - ARG Amplicon Detection


Each primer/ARG genes presented is expected to be found based on the mock.expected.fa file. 75% of the primers expected to bind genes in the mock samples were successful, and detected in the sequencing. 87% of the ARG genes expected to be found in the mock samples were found.


ARGs in Mock

BLAST against mock.expected.fa, compare to mock.expected.fa. Looking to see how many of the known ARGs are detected. Something is wrong with Q3_Mock_0_025spike_C.

By Primer

Same data, but by individual primer pressence.


Sample Ordination


Non-Normalized PCA

Distances were calculated from raw read counts to compare to how normalizing with relative abundance clusters replicates. Without housekeeping genes, we can’t directly compare samples to each other in terms of abundance or expression of certain genes. With relative abundance though, we should be able to compare the ratios.


Normalized PCA

Using relative abundance to normalize samples clusters the replicates tighter, indicating that it works well overall as a normalization method.


NMDS


Hierarchical Clustering


Bray-Curtis


Jaccard


Euclidean


Gower


K-Means Clustering


Optimize K


K(4)



K(6)



K(15)



Sample Compositon


ARG Profile



NGS vs PCR